The journey to high-performance kernels begins by transitioning from operation-centric programming (PyTorch Eager) to hardware-aware programming. Triton serves as the critical bridge in this path.
1. Defining the Stack
Triton is a language and compiler for parallel programming, designed to make it practical to write high-performance custom compute kernels in Python syntax. It occupies a unique middle ground:
- PyTorch Eager: High abstraction, easy to use, but limited control over hardware utilization.
- CUDA C++: Maximum control, but high complexity (manual management of shared memory and synchronization).
- Triton: Pythonic syntax with block-level (tiled) control.
2. The Tiled Paradigm
Unlike CUDA, which operates at the thread level, Triton utilizes a block-based (tiled) programming model. This is especially relevant for Deep Learning where data (matrices, attention maps) is naturally structured in blocks.
3. The Performance Fallacy
A common misconception is thinking Triton is just "PyTorch but faster." In reality, it is a separate paradigm. Performance gains come from the developer's ability to eliminate bottlenecks (like the "Memory Wall") by fusing operations to keep data in fast on-chip SRAM.